Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from d0fc897 to 3f0e751 Compare January 9, 2026 00:00
Copy link
Contributor Author

MasterPtato commented Jan 9, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@MasterPtato MasterPtato force-pushed the 01-08-fix_gas_dont_include_error_iters_in_metric branch from ce9397d to a6a1a86 Compare January 9, 2026 00:00
@claude
Copy link

claude bot commented Jan 9, 2026

Code Review

This PR makes several small but important fixes. Overall the changes look good, but I have a few observations and suggestions:

✅ Positive Changes

  1. Actor Error Tracking Fix (actor/mod.rs:1061-1065) - The change to only set crash failure reason for StopCode::Error is correct. Previously, the code checked code != StopCode::Ok, which was redundant since StopCode only has two variants (Ok and Error). The new pattern match is clearer and more explicit about the intent.

  2. Removed Pending Actors Metric - Removing the ACTOR_PENDING_ALLOCATION metric tracking makes sense if it wasn't being used. The metric isn't referenced in any Grafana dashboards, so this cleanup is safe.

  3. TypeScript Error Logging (runner/src/mod.ts:876) - Using stringifyError() instead of direct string interpolation is good practice for consistent error formatting across the codebase.

  4. Added Instrumentation (pegboard-runner/src/conn.rs:257) - Adding #[tracing::instrument(skip_all)] to handle_init improves observability.

  5. Removed Noisy Log (runner2.rs:848-855) - Removing the drain_older_versions activity log makes sense if it was too verbose and not providing value.

🔍 Observations

  1. Metric Definition Still Exists - While the metric is no longer being set, the ACTOR_PENDING_ALLOCATION definition remains in engine/packages/pegboard/src/metrics.rs:4-9. Consider removing the metric definition entirely if it's no longer used anywhere.

  2. Test May Need Review - There's a test named actor_pending_allocation_no_runners in engine/packages/engine/tests/actors_lifecycle.rs:786. While the test itself doesn't appear to check the metric value, the test name suggests it was related to this metric functionality. Verify that the test still provides value without the metric.

  3. Duplicate Changes - The Grafana dashboard title change ("Rivet Guard" → "Guard") appears in 6 files. Per CLAUDE.md, these should be regenerated from the template. Consider verifying that the template at docker/template/ was updated and the configs were regenerated properly rather than manually editing each file.

🐛 Potential Issue

Error in runner2.rs Metric Name (metrics.rs:26-31) - This appears to be a pre-existing bug, but worth noting:

pub static ref SERVERLESS_OUTBOUND_REQ_TOTAL: IntCounterVec = register_int_counter_vec_with_registry!(
    "pegboard_runner_version_upgrade_drain_total",  // ❌ Wrong metric name
    "Count of serverless outbound requests made.",
    &["namespace_id", "runner_name"],
    *REGISTRY
).unwrap();

The metric name is duplicated from RUNNER_VERSION_UPGRADE_DRAIN_TOTAL. Should be pegboard_serverless_outbound_req_total.

📝 Suggestions

  1. Remove the unused ACTOR_PENDING_ALLOCATION metric definition from metrics.rs
  2. Consider renaming or updating the test actor_pending_allocation_no_runners to better reflect what it's testing (pending actor allocation behavior, not the metric)
  3. Fix the SERVERLESS_OUTBOUND_REQ_TOTAL metric name bug (separate PR if needed)

Summary

The changes are well-targeted and improve code clarity. The error tracking fix is particularly good as it makes the crash handling logic more explicit. Just consider the cleanup suggestions above to fully remove traces of the unused metric.

@claude
Copy link

claude bot commented Jan 9, 2026

Code Review - PR #3797

Summary

This PR makes three focused improvements:

  1. Removes the ACTOR_PENDING_ALLOCATION metric and its tracking logic
  2. Fixes the actor error tracking to only set crash failure reasons for actual errors
  3. Improves error message formatting in the TypeScript runner

Detailed Feedback

✅ Strengths

1. Metric Removal (runner.rs & runner2.rs)

  • Clean removal of the pending_actor_count tracking logic
  • Simplifies the allocation transaction by removing unnecessary state tracking
  • The metric is properly removed from all usage sites

2. Actor Error Tracking Fix (actor/mod.rs:1061-1065)

  • Good fix! The previous logic was checking for non-Ok codes but then failing with ensure!, which would cause crashes
  • The new pattern match specifically targets StopCode::Error, which is the correct condition for setting crash failure reasons
  • This aligns with the comment "Runner failure reasons are already set at the start of handle_stopped"

3. TypeScript Error Formatting (runner/src/mod.ts:876)

  • Using stringifyError ensures proper error serialization instead of potentially getting [object Object]
  • Good consistency with other error logging in the file

4. Instrumentation Addition (pegboard-runner/src/conn.rs:257)

  • Adding #[tracing::instrument(skip_all)] to handle_init improves observability

⚠️ Concerns & Suggestions

1. Orphaned Metric Definition

The ACTOR_PENDING_ALLOCATION metric is still defined in engine/packages/pegboard/src/metrics.rs:4-9 but is no longer used anywhere. This should be removed entirely.

// Remove this from metrics.rs
pub static ref ACTOR_PENDING_ALLOCATION: IntGaugeVec = register_int_gauge_vec_with_registry!(
    "pegboard_actor_pending_allocation",
    "Total actors waiting for availability.",
    &["namespace_id", "runner_name"],
    *REGISTRY
).unwrap();

2. Removed Log Statement Context

In runner2.rs:848-855, the drain_older_versions activity had its tracing info statement removed. While removing unnecessary logging can be good, consider whether this was providing useful debugging information. If this was just noisy logging, the removal is fine. If it was helpful for debugging draining operations, consider keeping it or documenting why it was removed.

3. Grafana Dashboard Changes

The title change from "Rivet Guard" to "Guard" is applied across multiple environments. This appears to be an unrelated change to the main PR purpose. Consider:

  • Is this intentional or accidental?
  • Should this be in a separate PR for clarity?
  • Based on CLAUDE.md, these files should be regenerated from docker/template/. Did you regenerate them properly?

4. Missing Test Coverage

The PR doesn't include tests for:

  • The fixed actor error tracking logic (verifying that only StopCode::Error triggers crash failure reasons)
  • The removal of the pending actor count metric (ensuring metrics are still correctly tracked)

Security & Performance

No security concerns identified.

Performance: The removal of pending actor count tracking should provide a minor performance improvement by reducing state management in the allocation loop.

Code Style & Best Practices

Minor issues:

  • The orphaned metric definition should be cleaned up
  • Consider whether the Grafana changes belong in this PR

Verdict

Approve with minor changes recommended:

  1. Remove the unused ACTOR_PENDING_ALLOCATION metric definition from metrics.rs
  2. Verify the Grafana dashboard changes are intentional and properly regenerated
  3. Consider adding tests for the actor error tracking fix

The core fixes are solid and improve the codebase. The actor error tracking fix in particular addresses what appears to be a real bug.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 9, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3797

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3797

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3797

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3797

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3797

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3797

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3797

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3797

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3797

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3797

commit: ff1afac

@claude
Copy link

claude bot commented Jan 9, 2026

Pull Request Review

This PR makes several focused fixes related to Pegboard actor/runner error handling and metrics. Here's my analysis:

✅ Positive Changes

  1. Actor Error Tracking Fix (packages/pegboard/src/workflows/actor/mod.rs:1061-1065)

    • The change from checking code != protocol::mk2::StopCode::Ok to specifically matching code: protocol::mk2::StopCode::Error is more precise
    • This prevents setting failure reasons for graceful shutdowns (Ok status), which is correct behavior
    • Good fix that makes the intent clearer
  2. TypeScript Error Stringification (sdks/typescript/runner/src/mod.ts:889)

    • Using stringifyError() instead of direct string interpolation ensures proper error formatting
    • Consistent with the rest of the codebase which already uses this pattern throughout
    • Good defensive programming
  3. Instrumentation Addition (packages/pegboard-runner/src/conn.rs:257)

    • Adding #[tracing::instrument(skip_all)] to handle_init improves observability
    • Follows the codebase pattern for tracing
  4. Noise Reduction (packages/pegboard/src/workflows/runner2.rs:848-854)

    • Removing the verbose drain_older_versions log is reasonable if this is noisy/redundant
    • The operation is already being instrumented elsewhere

⚠️ Areas of Concern

  1. Metric Removal Without Replacement (runner.rs & runner2.rs)

    • The ACTOR_PENDING_ALLOCATION metric is completely removed but still defined in metrics.rs:4-9
    • Issue: The metric definition is now dead code - it should be removed from metrics.rs if no longer used
    • Impact: This creates technical debt and could confuse future maintainers
    • The metric was tracking "Total actors waiting for availability" which could be valuable observability data
    • Question: Was this metric inaccurate, or is it being replaced by another metric? If it was providing incorrect data (as the variable tracking suggests), that's a good reason to remove it, but the definition should also be cleaned up
  2. Pending Actor Count Logic Removal

    • The removed code was attempting to count pending actors within the transaction loop
    • The logic had flaws: it incremented at the start of each iteration but decremented when allocation succeeded, which could give incorrect counts when actors are skipped for other reasons
    • Good: The removal suggests this count was unreliable/inaccurate
    • Concern: If monitoring pending actors is important, consider adding proper instrumentation elsewhere (perhaps at the queue level rather than during allocation)
  3. Grafana Dashboard Changes

    • Multiple dashboard JSON files were updated (only title change: "Rivet Guard" → "Guard")
    • Per CLAUDE.md: "Do not make changes to docker/dev* configs. Instead, edit the template in docker/template/ and rerun"
    • Issue: While the template was updated, the regeneration should have been done via (cd docker/template && pnpm start)
    • These look like manually edited duplicates which violates the documented pattern
    • However, if they were properly regenerated from the template, this is fine

🔍 Potential Issues

  1. Missing Cleanup: The ACTOR_PENDING_ALLOCATION metric definition should be removed from metrics.rs since it's no longer used anywhere
  2. Documentation: Consider adding a comment explaining why the pending actor metric was removed if it's due to accuracy issues

📋 Recommendations

  1. Remove the unused metric definition:

    // In engine/packages/pegboard/src/metrics.rs
    - pub static ref ACTOR_PENDING_ALLOCATION: IntGaugeVec = register_int_gauge_vec_with_registry!(
    -     "pegboard_actor_pending_allocation",
    -     "Total actors waiting for availability.",
    -     &["namespace_id", "runner_name"],
    -     *REGISTRY
    - ).unwrap();
  2. Verify the Grafana dashboard files were generated from the template correctly (not manually edited)

🎯 Summary

Code Quality: Good - focused bug fixes with clear intent
Security: No concerns
Performance: No negative impact, possible small improvement from removing unnecessary counting
Test Coverage: Consider adding a test case for the StopCode::Ok vs StopCode::Error distinction to prevent regression

Overall this is a solid cleanup PR. The main action item is removing the dead metric definition. The actor error tracking fix is particularly good as it prevents incorrect failure reasons for graceful shutdowns.

@MasterPtato MasterPtato changed the base branch from 01-08-fix_gas_dont_include_error_iters_in_metric to graphite-base/3797 January 9, 2026 00:38
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 3f0e751 to 1ba3107 Compare January 9, 2026 00:38
@MasterPtato MasterPtato changed the base branch from graphite-base/3797 to main January 9, 2026 00:38
@MasterPtato MasterPtato changed the base branch from main to graphite-base/3797 January 10, 2026 02:00
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 1ba3107 to 22e1cee Compare January 10, 2026 02:00
@vercel
Copy link

vercel bot commented Jan 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivet-cloud Ready Ready Preview, Comment Jan 10, 2026 2:02am
rivetkit-serverless Error Error Jan 10, 2026 2:02am
2 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-inspector Ignored Ignored Preview Jan 10, 2026 2:02am
rivet-site Ignored Ignored Preview Jan 10, 2026 2:02am

@MasterPtato MasterPtato changed the base branch from graphite-base/3797 to 01-08-chore_engine_add_support_for_epoxy_debugging January 10, 2026 02:00
@MasterPtato MasterPtato force-pushed the 01-08-chore_engine_add_support_for_epoxy_debugging branch from d789061 to df64d01 Compare January 12, 2026 19:08
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 22e1cee to 2701da3 Compare January 12, 2026 19:08
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 2701da3 to 7057252 Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-08-chore_engine_add_support_for_epoxy_debugging branch from df64d01 to 0a56839 Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-08-chore_engine_add_support_for_epoxy_debugging branch from 0a56839 to e11706d Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 7057252 to 373f6cf Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 373f6cf to ff1afac Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-08-chore_engine_add_support_for_epoxy_debugging branch from e11706d to fbe3ebc Compare January 14, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants